Classifying Neutrino Flavour in the IceCube Experiment¶

PHYS-3220 Project Part 2¶

Joseph Cuzzupoli¶

This notebook takes the reader through the IceCube Neutrino Observatory dataset presented by Particle Physics Playground. The dataset is a record of 10 events of neutrino flux observations made by IceCube on Earth, which quantifies the parameters of optical signal and cartesian coordinates associated with the various detection events. The detector itself monitors light emission due to Cherenkov radiation generated by interactions of high energy neutrinos with atoms in the ice. When neutrinos pass through the observatory and induce interactions, relativistic elementary particles are released. Within the ice, these elementary particles can move faster than the local speed of light. When this happens, similar to when airplanes move at mach speed and produce a sonic boom, these faster-than-light particles create light booms known as Cherenkov radiation. Detectors, that are lined within the observatory, made of boroscilicate glass serve as photosensitive modules that detect this radiation. Over the course of 10 separate detection events, recording light generated by different flavours of neutrinos, the entire dataset is compiled. This investigation seeks to be able to classify neutrino flavour based on parameters such as optical signal and frequency of detected hits during an event. What can be deduced from the contents of this analysis is that the results suggest that this is possible, but would of course require a more thorough analysis to conclude.

Libraries and Data Load-In¶

This cell loads in the neccessary libraries such as pandas, seaborn, sklearn modules, etc. These libraries are used extensivley throughout this notebook

In [2]:
#CITE: Particle Physics Playground
import sys
import matplotlib
import matplotlib.path as mpath
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure, show, axis
from matplotlib.patches import Ellipse
import scipy
import numpy as np
from numpy import *
import math
import pylab
import random
from pylab import *
from astropy.coordinates import SkyCoord
from astropy import units as u
import pylab as P
from astropy.io import ascii
import pandas as pd
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cluster import KMeans, k_means
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
from sklearn.linear_model import LinearRegression, LogisticRegression, Lasso, LassoCV
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import warnings
from matplotlib import MatplotlibDeprecationWarning

warnings.filterwarnings(
    "ignore",
    category=MatplotlibDeprecationWarning
)


matplotlib.style.use('ggplot')

sns.set_style('darkgrid')

%matplotlib inline
%config InlineBackend.figure_format ='retina'

This cell reads in the data. This step is provided by the Particle Physics Playground. An addition made to this is converting the ratio of "hits per parameter" into units by engineering columns. I also turn everything into one large dataframe to monitor the neutrino hits in all events simulteneously.

In [3]:
# IMPORT DATA STEP (This step is provided by the Particle Physics Playground)

import pandas as pd
import numpy as np
import scipy.stats as stats
import csv
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

!pip install -q git+https://github.com/mattbellis/h5hep.git
!pip install -q git+https://github.com/mattbellis/particle_physics_simplified.git

import pps_tools as pps
import h5hep
import numpy as np
import matplotlib.pyplot as plt

import plotly.express as px

infilename = 'icecube_events_small.h5'

pps.download_from_drive('icecube_events_small.h5')
infilename = 'data/' + infilename

all_data = pps.get_all_data(infilename)

# This fills 'event' with one entry from 'data'
event = pps.get_icecube_event(all_data, 0)
df_event = pd.DataFrame(event)


df_event["q"] = df_event["hits/nhits"]/df_event["hits/q"]
df_event["t"] = df_event["hits/nhits"]/df_event["hits/t"]
df_event["x"] = df_event["hits/nhits"]/df_event["hits/x"]
df_event["y"] = df_event["hits/nhits"]/df_event["hits/y"]
df_event["z"] = df_event["hits/nhits"]/df_event["hits/z"]

df_event.head()
Loading in the data...

Building the indices...
Built the indices!
Data is read in and input file is closed.
Out[3]:
_SINGLETON_/INDEX hits/nhits hits/q hits/t hits/x hits/y hits/z q t x y z
0 1 844 1.025 20513.0 -9.130000 -481.739990 158.820007 823.414653 0.041145 -92.442496 -1.751982 5.314192
1 1 844 0.375 20538.0 -9.130000 -481.739990 158.820007 2250.666667 0.041095 -92.442496 -1.751982 5.314192
2 1 844 0.925 8724.0 114.389999 -461.989990 498.660004 912.432421 0.096745 7.378267 -1.826879 1.692536
3 1 844 0.825 10320.0 361.000000 -422.829987 414.410004 1023.030318 0.081783 2.337950 -1.996074 2.036630
4 1 844 0.425 11294.0 -211.350006 -404.480011 498.630005 1985.882297 0.074730 -3.993376 -2.086630 1.692638

This section of the code initializes the df_all_events dataframe which is the amalgamation of all the events into one frame. This allows us to check for different neutrinos that are observed in all the events and cross reference them against each other directly

In [4]:
#CITE: chatGPT assistance
all_events = []
for i in range(10):
  #Load in event i from the entire event library
    df_event = pd.DataFrame(pps.get_icecube_event(all_data, i))
  #Engineer the non-ratio features (i.e. get rid of hits per unit parameter)
    df_event["q"] = df_event["hits/nhits"] / df_event["hits/q"]
    df_event["t"] = df_event["hits/nhits"] / df_event["hits/t"]
    df_event["x"] = df_event["hits/nhits"] / df_event["hits/x"]
    df_event["y"] = df_event["hits/nhits"] / df_event["hits/y"]
    df_event["z"] = df_event["hits/nhits"] / df_event["hits/z"]
#Index the event number
    df_event["event"] = i + 1
#Make a master list
    all_events.append(df_event)
#Concatenate the master list into a master dataframe that include the data of all events 1-10
df_all_events = pd.concat(all_events, ignore_index=True)
#Use the groupby feature to make it easy to group by the event and calculate the mean and standard deviation for each event optical parameter and put it in a lis
means = df_all_events.groupby("event")["hits/q"].mean().tolist()
stds = df_all_events.groupby("event")["hits/q"].std().tolist()
ind = df_all_events.groupby("event")["q"].mean().tolist()
st = df_all_events.groupby("event")["q"].std().tolist()
index = list(range(1, 11))

The columns on the far right are engineered from the number of hits and the ratio of hits. How this was done is by dividing through with the specific ratio parameter (i.e. hits/z) and then multiplying by the hits value corresponding to that event. In this way, one can transform into a different unit space and examine a more physical parameter such as z or q. This ended up not being too important but served as an interesting lens to exmaine the data

Initial Exploratory Data Analysis¶

With the data read in, we can now explore some initial plots and decide how to approach the analysis. There are a few things we are looking for. Since we want to classify the flavour of the neutrino based on the optical parameter q as well as the hits per unit time, we need to be able to;

  • Show clear differences in the optical parameter across events
  • Show clear differences in the hits per unit time across events

Why are these 2 items important. These parameters are the quantifiers of the relative energy of the products created during fundamental particle interactions. The optical parameter, according to simulations, has shown to have distint signatures that are like fingerprints for the specific neutrino flavour. In other words, one can directly determine the flavour of the neutrino from its unique optical signal it produces. With this, one would instincitvley search for these high energy events in the data (which is shown below).

The first thing we'd like to do is really understand the data and how it is organized. The cell below checks for null values and performs simple deductive statistics on the data.

In [5]:
#CITE: Homework2 data examination
event = pps.get_icecube_event(all_data,0)
event_data = pd.DataFrame(event)
event_data.info()
print(event_data.describe())
print("The shape of one event of IceCube data (in this case event 0) is ",event_data.shape)
nevents = pps.get_number_of_entries(all_data)
print("There are {0} events in this dataset".format(nevents))
print("The number of sample points for this particular event is ", event_data.size)
mean_hitsq = event_data.iloc[:,2].mean()
median_hitsq = event_data.iloc[:,2].median()
mode_hitsq = event_data.iloc[:,2].mode()
std_hitsq = event_data.iloc[:,2].std()
var_hitsq = event_data.iloc[:,2].var()
print("Some statistics that can be examined for the hits-q (which is a measure of the optical signal) are as follows, ")
print("Mean:",mean_hitsq)
print("Median:",median_hitsq)
print("Mode:",mode_hitsq[0])
print("Standard Deviation:",std_hitsq)
print("Variance:",var_hitsq)
print("___________________________________________")
print("For the df_all_events dataframe...")

print(df_all_events.describe().T)
group=df_all_events.groupby(['event','hits/nhits'])
group.mean()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 844 entries, 0 to 843
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   _SINGLETON_/INDEX  844 non-null    int64  
 1   hits/nhits         844 non-null    int64  
 2   hits/q             844 non-null    float32
 3   hits/t             844 non-null    float32
 4   hits/x             844 non-null    float32
 5   hits/y             844 non-null    float32
 6   hits/z             844 non-null    float32
dtypes: float32(5), int64(2)
memory usage: 29.8 KB
       _SINGLETON_/INDEX  hits/nhits      hits/q        hits/t      hits/x  \
count              844.0       844.0  844.000000    844.000000  844.000000   
mean                 1.0       844.0    0.983021  13589.759766   17.928196   
std                  0.0         0.0    1.058903   2464.109131  108.427231   
min                  1.0       844.0    0.023053   6008.000000 -481.600006   
25%                  1.0       844.0    0.475000  12582.094727  -10.970000   
50%                  1.0       844.0    0.875000  13210.000000   31.250000   
75%                  1.0       844.0    1.175000  13657.000000   72.370003   
max                  1.0       844.0   14.921871  25479.000000  576.369995   

           hits/y      hits/z  
count  844.000000  844.000000  
mean    -8.074780 -237.688004  
std     74.948410  208.913879  
min   -481.739990 -503.630005  
25%    -60.470001 -361.149994  
50%      6.720000 -290.160004  
75%     27.090000 -197.880005  
max    463.720001  505.720001  
The shape of one event of IceCube data (in this case event 0) is  (844, 7)
There are 10 events in this dataset
The number of sample points for this particular event is  5908
Some statistics that can be examined for the hits-q (which is a measure of the optical signal) are as follows, 
Mean: 0.9830211400985718
Median: 0.875
Mode: 1.025
Standard Deviation: 1.0589032173156738
Variance: 1.1212760210037231
___________________________________________
For the df_all_events dataframe...
                     count          mean          std          min  \
_SINGLETON_/INDEX  12414.0      1.000000     0.000000     1.000000   
hits/nhits         12414.0   1422.352183   488.859518   697.000000   
hits/q             12414.0      1.006837     1.271385     0.023053   
hits/t             12414.0  13110.463867  2693.843994  5629.000000   
hits/x             12414.0      3.634275   147.579620  -570.900024   
hits/y             12414.0    -38.538662   169.346558  -521.080017   
hits/z             12414.0   -171.131622   244.023712  -510.570007   
q                  12414.0   2564.506975  3036.542128    11.133595   
t                  12414.0      0.111840     0.042349     0.025048   
x                  12414.0    -18.440181   104.314642  -236.473162   
y                  12414.0     41.530080    91.870028  -138.397433   
z                  12414.0     -0.343303    28.305936  -242.857133   
event              12414.0      5.906235     2.746974     1.000000   

                            25%           50%           75%           max  
_SINGLETON_/INDEX      1.000000      1.000000      1.000000      1.000000  
hits/nhits           844.000000   1502.000000   1799.000000   2159.000000  
hits/q                 0.525000      0.877201      1.197482     62.603321  
hits/t             11940.829102  12807.194824  13465.000000  32831.429688  
hits/x               -10.970000     35.540001     79.410004    576.369995  
hits/y               -79.500000      6.720000     35.490002    509.500000  
hits/z              -350.059998   -227.089996      4.460000    512.950012  
q                   1018.153601   1629.433904   2789.576106  93652.764133  
t                      0.068302      0.116846      0.143110      0.380978  
x                    -65.503643      7.368142     19.880650   1262.573071  
y                    -11.443810      8.259871     49.780956    618.624640  
z                     -6.432273     -3.812780      1.659338    449.700610  
event                  4.000000      6.000000      9.000000     10.000000  
Out[5]:
_SINGLETON_/INDEX hits/q hits/t hits/x hits/y hits/z q t x y z
event hits/nhits
1 844 1.0 0.983022 13589.749023 17.928259 -8.074822 -237.687881 1610.596205 0.063757 -19.523140 42.385751 -1.935380
2 800 1.0 0.990026 13571.052734 19.517662 -8.888537 -229.903732 1534.719519 0.060555 -15.512202 35.838132 -1.817593
3 917 1.0 0.928878 13113.987305 -59.272289 50.967949 -226.473145 1584.537348 0.072948 -1.330183 20.971932 1.582592
4 1302 1.0 0.989285 13204.146484 -44.376842 38.416965 -201.068817 2383.310279 0.102313 -19.111217 48.753177 0.174137
5 2159 1.0 0.948680 13380.733398 24.405081 -7.232395 -214.480164 3936.944966 0.165799 -33.334319 81.130253 -4.030101
6 1560 1.0 0.943738 12960.739258 -17.750942 38.081303 -180.701965 2738.057229 0.123935 -9.371805 36.010840 0.466757
7 697 1.0 1.426697 11469.996094 7.771450 -394.524750 165.350250 1135.115307 0.062160 -24.704954 2.485091 5.097271
8 834 1.0 0.912425 14003.599609 75.083023 -61.713070 -229.410278 1508.803047 0.062309 7.824406 0.688895 -1.717514
9 1799 1.0 0.957915 13198.725586 -11.395954 33.099060 -208.950974 3315.569639 0.140868 -7.788139 55.207786 0.172943
10 1502 1.0 1.157328 12438.828125 35.930401 -225.118484 -48.943596 2601.278090 0.123837 -41.699772 23.595114 1.790371

With this data, we are able to actually reconstruct the event and make a 3D interactive plot of the event detection that occurred. This is supplied by Particle Physics Playground. and shown below

In [6]:
#CITE Particle Physics Playground introduction to the IceCube experiment notebook
fig = pps.display_icecube_event(event_data)
fig
#
here
In [7]:
#CITE: https://python-graph-gallery.com/11-grouped-barplot/
#This visualizes the standard deviation and means of the optical parameter and
#hits per unit time parameter across each of the events
stds = []
means = []
index = []

ind = []
st = []

for i in range(0,10):
  #Loops over all of the events storred in the larger data library
  df_event = pd.DataFrame(pps.get_icecube_event(all_data, i))
  #Engineering the non hit-frequency values (i.e. getting rid of the hits per parameter and turning into real units)
  df_event["q"] = df_event["hits/nhits"]/df_event["hits/q"]
  df_event["t"] = df_event["hits/nhits"]/df_event["hits/t"]
  df_event["x"] = df_event["hits/nhits"]/df_event["hits/x"]
  df_event["y"] = df_event["hits/nhits"]/df_event["hits/y"]
  df_event["z"] = df_event["hits/nhits"]/df_event["hits/z"]
  #Taking the mean of hits per optical signal parameter and the optical parameter alone
  means.append(pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/q'].mean())
  stds.append(pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/q'].std())
  ind.append(df_event["q"].mean())
  st.append(df_event["q"].std())
  index.append(i+1)
#CITE AI USAGE - this block simply spaces out the bars of the mean and standard deviation bar plots so that they can become a grouped bar plot.
barWidth = 0.4
r = np.arange(len(means))
r2 = r + barWidth
r3 = r2 + barWidth
###
fig, ax = plt.subplots(dpi=100)
ax.bar(r, means, color='red', width=barWidth, edgecolor='white', label='Optical Signal Mean')
ax.bar(r2, stds, color='blue', width=barWidth, edgecolor='white', label='Optical Signal std')
#Labelling the plot and making sure the labels are large and readable
ax.set_xlabel('Event Number', fontweight='bold', fontsize=15)
ax.set_ylabel('Arbitray Signal Units', fontweight='bold', fontsize=15)
ax.set_xticks(r + barWidth)
ax.set_xticklabels(index, fontsize=15)
ax.tick_params(axis='y', labelsize=15)
plt.title('Optical Signal Mean and Std for All Events', fontweight='bold', fontsize=15)
ax.legend(fontsize=15)
plt.show()
#This is a repeat of above, but just using the hits/t column data instead
tstds = []
tmeans = []
tindex = []

tind = []
tst = []

for i in range(0,10):
  df_event = pd.DataFrame(pps.get_icecube_event(all_data, i))
  df_event["q"] = df_event["hits/nhits"]/df_event["hits/q"]
  df_event["t"] = df_event["hits/nhits"]/df_event["hits/t"]
  df_event["x"] = df_event["hits/nhits"]/df_event["hits/x"]
  df_event["y"] = df_event["hits/nhits"]/df_event["hits/y"]
  df_event["z"] = df_event["hits/nhits"]/df_event["hits/z"]
  tmeans.append(pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/t'].mean())
  tstds.append(pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/t'].std())
  tind.append(df_event["q"].mean())
  tst.append(df_event["q"].std())
  tindex.append(i+1)
#CITE AI USAGE - this block simply spaces out the bars of the mean and standard deviation bar plots so that they can become a grouped bar plot.
barWidth = 0.4
r = np.arange(len(means))
r2 = r + barWidth
r3 = r2 + barWidth
###
fig, ax = plt.subplots(dpi=100)
ax.bar(r, tmeans, color='red', width=barWidth, edgecolor='white', label='Hits Per Time Mean')
ax.bar(r2, tstds, color='blue', width=barWidth, edgecolor='white', label='Hits per Time std')

ax.set_xlabel('Event Number', fontweight='bold', fontsize=14)
ax.set_ylabel('Arbitray Signal Units', fontweight='bold', fontsize=14)
ax.set_xticks(r + barWidth)
ax.set_xticklabels(tindex, fontsize=14)
ax.tick_params(axis='y', labelsize=14)


plt.title('Hits per Time Mean and Std for All Events', fontweight='bold', fontsize=15)
ax.legend()
plt.show()

These plots show that for the mean and standard deviation of the optical parameter measurement, it shows a clear spike in optical intensity in the events 7 and 10. This may indiciate a difference in the neutrino flavour that caused them! Let's examine further and see what else we can identify

This plot visualizes the histogram distrubutions of the optical parameter measurement amongst individual events to examine the spread of the optical signal in the IceCube experiment. This it applied for both the hits/q value which is provided by Particle Physics Playground as well as an engineered optical measurement column (q) which I designed by dividing out the number of hits from nhits described above.

In [8]:
###CITE IceCube experiment notebook for the histogram plot layout
plt.figure(figsize=(20,8))
for i in range(0,10):
  plt.subplot(2, 5, i + 1)
  event = pps.get_icecube_event(all_data,i)
  df_event = pd.DataFrame(pps.get_icecube_event(all_data, i))
  df_event["q"] = df_event["hits/nhits"]/df_event["hits/q"]
  df_event["t"] = df_event["hits/nhits"]/df_event["hits/t"]
  df_event["x"] = df_event["hits/nhits"]/df_event["hits/x"]
  df_event["y"] = df_event["hits/nhits"]/df_event["hits/y"]
  df_event["z"] = df_event["hits/nhits"]/df_event["hits/z"]
  q = event['hits/q']
  q_mean = np.mean(q)
  q_std = np.std(q)
  #plt.title("Event:"+str(i+1), fontsize=16)
  plt.hist(q,bins=75,label=f"Mean:{q_mean:.2f} Std:{q_std:.2f}")
  plt.xticks(rotation=35, fontsize=15)
  plt.yticks(rotation=35, fontsize=15)
  plt.legend(fontsize=12)
plt.tight_layout(rect=[0.05, 0.05, 0.95, 0.95])
plt.figtext(0.5, -0.03, 'Optical Measurement (q)', ha='center', fontsize=18, fontweight='bold')
plt.figtext(-0.03, 0.5, 'Frequency', va='center', rotation='vertical', fontsize=18, fontweight='bold')
plt.figtext(0.3, 1.05, 'hits/q Optical Parameter Histograms for Events 1-10', va='center', fontsize=25, fontweight='bold')


plt.tight_layout()
###
print(







)

###CITE IceCube experiment notebook for the histogram plots layout
plt.figure(figsize=(20,8))

for i in range(0,10):
  plt.subplot(2, 5, i + 1)
  event = pps.get_icecube_event(all_data,i)
  df_event = pd.DataFrame(pps.get_icecube_event(all_data, i))
  df_event["q"] = df_event["hits/nhits"]/df_event["hits/q"]
  df_event["t"] = df_event["hits/nhits"]/df_event["hits/t"]
  df_event["x"] = df_event["hits/nhits"]/df_event["hits/x"]
  df_event["y"] = df_event["hits/nhits"]/df_event["hits/y"]
  df_event["z"] = df_event["hits/nhits"]/df_event["hits/z"]
  #plt.title(f"Event: {i+1}", fontsize=16)
  plt.hist(df_event['q'],bins=75,label=f"Mean: {df_event['q'].mean():.2f} Std: {df_event['q'].std():.2f}", color = 'blue');
  plt.xticks(rotation=35, fontsize=13)
  plt.yticks(rotation=35, fontsize=13)
  plt.legend(fontsize=12)

plt.tight_layout(rect=[0.05, 0.05, 0.95, 0.95])
plt.figtext(0.5, -0.03, 'Optical Measurement (q)', ha='center', fontsize=20, fontweight='bold')
plt.figtext(-0.03, 0.5, 'Frequency', va='center', rotation='vertical', fontsize=20, fontweight='bold')
plt.figtext(0.3, 1.05, 'q Optical Parameter Histograms for Events 1-10', va='center', fontsize=25, fontweight='bold')

plt.tight_layout()
###
###

These plots show the scatters of the hits/q parameter against other variables such as hits/t, hits/x,y,z etc. It also includes a heatmap of the number of detected hits during the event which is plotted along with the data in the scatter plot

In [9]:
 #AI assistance: initializing figure and creating a subplot
fig2=plt.figure(2)
ax=fig2.add_subplot(111)
###
choice = [3,7] #Choice of the events that I wish to look at
for i in choice:
  #Make a scatter of hits/q as a function of hits/y
  plt.scatter(pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/y'], pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/q'],marker='o',s=40,c=pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/t'],label=f'Event {i}', alpha=0.8)
cbar=plt.colorbar()
ax.set_xlabel(r'$\mathrm{hits/y}$',fontsize=18)
ax.set_ylabel(r'$\mathrm{hits/q}$',fontsize=18)
cbar.set_label(r'$\mathrm{hits/t}$',fontsize=18)
#ax.set_ylim(0, 2.50)  # **Set y-axis limits**chatgpt
plt.legend()
plt.title('Colour Map of hits/t Overlaid on Scatter of hits/q as a function of hits/y)')
plt.show()
#Same plot as above, just zoomed in
fig2=plt.figure(2)
ax=fig2.add_subplot(111)
choice = [3,7] #Choice of the events I want to look at
for i in choice:
  #potting the hits/q as a function of hits/y
  plt.scatter(pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/y'], pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/q'],marker='o',s=40,c=pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/t'],label=f'Event {i}', alpha=0.8)
cbar=plt.colorbar()
ax.set_xlabel(r'$\mathrm{hits/y}$',fontsize=18)
ax.set_ylabel(r'$\mathrm{hits/q}$',fontsize=18)
cbar.set_label(r'$\mathrm{hits/t}$',fontsize=18)
ax.set_ylim(0, 2.50)  # **Set y-axis limits**chatgpt
plt.legend()
plt.title('Colour Map of hits/t Overlaid on Scatter of hits/q as a function of hits/y (Zoomed in)')
plt.show()

#################################################
fig2=plt.figure(2)
ax=fig2.add_subplot(111)
choice = [3,7]
for i in choice:
  df_event = pd.DataFrame(pps.get_icecube_event(all_data, i))
  df_event["q"] = df_event["hits/nhits"]/df_event["hits/q"]
  df_event["t"] = df_event["hits/nhits"]/df_event["hits/t"]
  df_event["x"] = df_event["hits/nhits"]/df_event["hits/x"]
  df_event["y"] = df_event["hits/nhits"]/df_event["hits/y"]
  df_event["z"] = df_event["hits/nhits"]/df_event["hits/z"]
  plt.scatter(df_event['y'], df_event['q'],marker='o',s=40,c=df_event['t'],label=f'Event {i}', alpha=0.8)
cbar=plt.colorbar()
ax.set_xlabel(r'$\mathrm{y}$',fontsize=18)
ax.set_ylabel(r'$\mathrm{q}$',fontsize=18)
cbar.set_label(r'$\mathrm{t}$',fontsize=18)
#ax.set_ylim(0, 2.50)  # **Set y-axis limits** chatchpt
plt.legend()
plt.title('Colour Map of t Overlaid on Scatter of q as a function of y)')
plt.show()

fig2=plt.figure(2)
ax=fig2.add_subplot(111)
choice = [3,7]
for i in choice:
  df_event = pd.DataFrame(pps.get_icecube_event(all_data, i))
  df_event["q"] = df_event["hits/nhits"]/df_event["hits/q"]
  df_event["t"] = df_event["hits/nhits"]/df_event["hits/t"]
  df_event["x"] = df_event["hits/nhits"]/df_event["hits/x"]
  df_event["y"] = df_event["hits/nhits"]/df_event["hits/y"]
  df_event["z"] = df_event["hits/nhits"]/df_event["hits/z"]
  plt.scatter(df_event['y'], df_event['q'], marker='o', s=40, c=df_event['t'],label=f'Event {i}', alpha=0.8)
cbar=plt.colorbar()
ax.set_xlabel(r'$\mathrm{y}$',fontsize=18)
ax.set_ylabel(r'$\mathrm{q}$',fontsize=18)
cbar.set_label(r'$\mathrm{t}$',fontsize=18)
plt.legend()
ax.set_xlim(-75, 75)  # **Set y-axis limits** Chatgpt
ax.set_ylim(0, 10000)  # **Set y-axis limits** chatgpt

plt.title('Colour Map of t Overlaid on Scatter of q as a function of y (Zoomed in)')
plt.show()

The plots above are looking at events 7 and 3 to compare the differences in the optical parameter and hits/t parameter. There is a large density of points with some extreme outliers. If we zoom into the high density regions we can see some natural clustering of the points.

In [10]:
#Same functional layout of the code as the plots above, they just look at different plotted parameters
fig2=plt.figure(2)
ax=fig2.add_subplot(111)
choice = [3,7]
for i in choice:
  plt.scatter(pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/t'], pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/q'],marker='o',s=40,c=pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/nhits'],label=f'Event {i}', alpha=0.8)
cbar=plt.colorbar()
plt.legend()
ax.set_xlabel(r'$\mathrm{hits/t}$',fontsize=18)
ax.set_ylabel(r'$\mathrm{hits/q}$',fontsize=18)
cbar.set_label(r'$\mathrm{hits/nhits}$',fontsize=18)
#ax.set_ylim(0, 2.50)
plt.legend()
plt.title('Colour Map of hits/nhits Overlaid on Scatter of hits/q as a function of hits/t')
plt.show()


fig2=plt.figure(2)
ax=fig2.add_subplot(111)
choice = [3,7]
for i in choice:
  plt.scatter(pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/t'], pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/q'],marker='o',s=40,c=pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/nhits'],label=f'Event {i}', alpha=0.8)
cbar=plt.colorbar()
plt.legend()
ax.set_xlabel(r'$\mathrm{hits/t}$',fontsize=18)
ax.set_ylabel(r'$\mathrm{hits/q}$',fontsize=18)
cbar.set_label(r'$\mathrm{hits/nhits}$',fontsize=18)
ax.set_ylim(0, 2.50)
plt.legend()
plt.title('Colour Map of hits/nhits Overlaid on Scatter of hits/q as a function of hits/t (ZOOMED IN)')
plt.show()



fig2=plt.figure(2)
ax=fig2.add_subplot(111)
choice = [3,7]
for i in choice:
  df_event = pd.DataFrame(pps.get_icecube_event(all_data, i))
  df_event["q"] = df_event["hits/nhits"]/df_event["hits/q"]
  df_event["t"] = df_event["hits/nhits"]/df_event["hits/t"]
  df_event["x"] = df_event["hits/nhits"]/df_event["hits/x"]
  df_event["y"] = df_event["hits/nhits"]/df_event["hits/y"]
  df_event["z"] = df_event["hits/nhits"]/df_event["hits/z"]
  plt.scatter(df_event['t'], df_event['q'],marker='o',s=40,c=df_event['hits/nhits'],label=f'Event {i}', alpha=0.8)
cbar=plt.colorbar()
ax.set_xlabel(r'$\mathrm{t}$',fontsize=18)
ax.set_ylabel(r'$\mathrm{q}$',fontsize=18)
cbar.set_label(r'$\mathrm{nhits}$',fontsize=18)
#ax.set_ylim(0, 2.50)
plt.legend()
plt.title('Colour Map of t Overlaid on Scatter of q as a function of nhits)')
plt.show()

fig2=plt.figure(2)
ax=fig2.add_subplot(111)
choice = [3,7]
for i in choice:
  df_event = pd.DataFrame(pps.get_icecube_event(all_data, i))
  df_event["q"] = df_event["hits/nhits"]/df_event["hits/q"]
  df_event["t"] = df_event["hits/nhits"]/df_event["hits/t"]
  df_event["x"] = df_event["hits/nhits"]/df_event["hits/x"]
  df_event["y"] = df_event["hits/nhits"]/df_event["hits/y"]
  df_event["z"] = df_event["hits/nhits"]/df_event["hits/z"]
  plt.scatter(df_event['t'], df_event['q'], marker='o', s=40, c=df_event['hits/nhits'],label=f'Event {i}', alpha=0.8)
cbar=plt.colorbar()
ax.set_xlabel(r'$\mathrm{t}$',fontsize=18)
ax.set_ylabel(r'$\mathrm{q}$',fontsize=18)
cbar.set_label(r'$\mathrm{nhits}$',fontsize=18)
plt.legend()
ax.set_xlim(0, 0.15)
ax.set_ylim(0, 12000)

plt.title('Colour Map of t Overlaid on Scatter of q as a function of y (Zoomed in)')
plt.show()

We can also look at some box plots and violin plots which reflect the number of outliers in the data. One will be able to see an abundance of extreme values in the optical parameter that may indicate high-energy interactions within the ice. This will be a point of examination.

In [11]:
fig1 = plt.figure(figsize=(9,4)) #initialize the plot
ax = fig1.gca()
#Make a boxplot of the hits/q data to examine the abundance of outliers in the data and the distrubution of optical signal values across events
ax = sns.boxplot(x="event", y="hits/q", data=df_all_events,notch=True, saturation=0.5, medianprops=dict(color='red'), linewidth=2) #CITE: Chatgpt: Changing the median line inside the interquartile box so it is more noticeable
ax.set_title('Hits/q Boxplot', fontweight='bold', fontsize=20)
ax.set_xlabel('Event Number', fontweight='bold', fontsize=20)
ax.set_ylabel('Arbitray Signal Units', fontweight='bold', fontsize=20)
ax.tick_params(axis='x', labelsize=18)
ax.tick_params(axis='y', labelsize=18)

plt.show()
#Make a violin of the hits/q data to examine the abundance of outliers in the data and the distrubution of optical signal values across events
fign = plt.figure(figsize=(9,4))
ax = fign.gca()
sns.violinplot(x="event", y="hits/q", data=df_all_events, color="skyblue", inner="quartile") #CITE: AI Assistance: Creating a violin plot to similarly showcase the abundance of outlier values; very similar to the box plot
ax.set_title("Hist/q Violin Plot", fontweight='bold', fontsize=16)
ax.set_xlabel('Event Number', fontweight='bold', fontsize=14)
ax.set_ylabel('Arbitray Signal Units', fontweight='bold', fontsize=14)
ax.tick_params(axis='y', labelsize=14)
ax.tick_params(axis='x', labelsize=14)

plt.show()

Correlation¶

We will not try to visualize the correlation between the parameters in the dataset using the pearson correlation matrix given by; $$ \text{pearson correlation}\;r = cor(X, Y) =\frac{cov(X, Y)}{std(X)std(Y)}$$This method will allow us to quantify how linearly related or not different parameters of the dataset are from one another. We will also plot it using a correlation heatmap and then tighten the scope of the analysis to only look at the more relevant parameters

In [12]:
corr = df_all_events.corr() #Calculates the correlation of the variables in the data set

f, ax = plt.subplots(figsize=(8, 8))
#Visualizes it using the seaborn library heatmap plot tool
ax = sns.heatmap(data=corr,  annot=True, fmt=".2f", square=True, ax=ax)
plt.title('Correlation Heatmap',size=16)
ax.set_xticklabels(ax.xaxis.get_ticklabels(), fontsize=10, rotation=45)
ax.set_yticklabels(ax.yaxis.get_ticklabels(), fontsize=10, rotation=0)
plt.show()
In [13]:
params = ['hits/q', 'hits/t', 'hits/x', 'hits/y', 'hits/z'] #reduces the scope of the plot to the most important and defining variables

corr = df_all_events[params].corr()

f, ax = plt.subplots(figsize=(8, 8))

sns.set(font_scale=1.5)
ax = sns.heatmap(data=corr,  annot=True, fmt=".2f", square=True, ax=ax)
plt.title('Correlation Matrix Heatmap',size=25)
ax.set_xticklabels(ax.xaxis.get_ticklabels(), fontsize=20, rotation=45)
ax.set_yticklabels(ax.yaxis.get_ticklabels(), fontsize=20, rotation=0)

plt.show()

What this shows is that the data is not strongly linearly correlated. However, there are some interesting variables which show a strong anti-linear correlation. hits/x and hits/z, and hits/t and hits/z as well as hits/y and hits/z show anti-linear correlation. Let's quickly visualize hits/x and hits/z to show this relationship

In [14]:
#More exploratory plots, with the same functional form as above.
fig2=plt.figure(2)
ax=fig2.add_subplot(111)

plt.scatter(df_all_events['hits/z'],df_all_events['hits/x'],marker='o',s=40,c=df_all_events['hits/nhits'],alpha=0.5) #Visualized using a heatmap
cbar=plt.colorbar()
ax.set_xlabel(r'$\mathrm{hits/t}$',fontsize=18)
ax.set_ylabel(r'$\mathrm{hits/q}$',fontsize=18)
cbar.set_label(r'$\mathrm{hits/nhits}$',fontsize=18)
#ax.set_ylim(0, 2.50)
#ax.set_xlim(4200, 15000)
plt.title('Colour Map of hits/nhits Overlaid on Scatter of hits/z as a function of hits/z')
plt.show()

These plots are similar to the ones above, however it plots ALL of the events onto one plot and zooms into the high density features of the data. It shows that there are clusters for the different hits/t values which indicates fast moving particles that are triggering many detectors per unit time. This may also be an indication of a high energy event.

In [26]:
#More exploratory plots using the same functional form as above
fig2=plt.figure(2)
ax=fig2.add_subplot(111)
#This is zoomed in to examine the separation of cluster groups
plt.scatter(df_all_events['hits/t'],df_all_events['hits/q'],marker='o',s=40,c=df_all_events['hits/nhits'],alpha=0.5)
cbar=plt.colorbar()
ax.set_xlabel(r'$\mathrm{hits/t}$',fontsize=18)
ax.set_ylabel(r'$\mathrm{hits/q}$',fontsize=18)
cbar.set_label(r'$\mathrm{hits/nhits}$',fontsize=18)
ax.set_ylim(0, 2.50)
ax.set_xlim(4200, 15000)
plt.title('Colour Map of hits/nhits Overlaid on Scatter of hits/q as a function of hits/t (ZOOMED IN)')
plt.show()


fig2=plt.figure(2)
ax=fig2.add_subplot(111)
#All of the events (0-10) optical data is plotted to observe differences in the optical intensity for all different classes of neutrinos togehter
for i in range(0,10):
  plt.scatter(pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/t'], pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/q'],marker='o',s=40,c=pd.DataFrame(pps.get_icecube_event(all_data, i))['hits/nhits'],label=f'Event {i}', alpha=0.5)
cbar=plt.colorbar()
plt.legend()
ax.set_xlabel(r'$\mathrm{hits/t}$',fontsize=18)
ax.set_ylabel(r'$\mathrm{hits/q}$',fontsize=18)
cbar.set_label(r'$\mathrm{hits/nhits}$',fontsize=18)
ax.set_ylim(0, 10)
#ax.set_xlim(4200, 15000)
plt.legend()
plt.title('Colour Map of hits/nhits Overlaid on Scatter of hits/q as a function of hits/t (ZOOMED IN)')
plt.show()

After examining these pieces of evidence to be able to classify neutrinos based on the optical parameter and the hits/t, I am most interested in the natural clustering in the hits/t parameter which indicates very fast moving particles that trigger many detectors. Thus, below begins the steps to use a Random Forest Regression algorithm to be able to predict the values of the hits/q and hits/t values using a 70/30 train test split

Predictions and Clustering¶

First I will maximize the cross validation accuracy as a function of the number of trees used to predict these parameters

In [16]:
#Takes the data and makes a target and feature matrix
Xkss = df_all_events[[c for c in df_all_events.columns if c not in ['q', 'hits/q', '_SINGLETON_/INDEX', 'event']]]
ykss = df_all_events['hits/q'].values

# Standardize features!!! Very important so that we are comparing quantities on different scales
ss = StandardScaler()
Xsss = ss.fit_transform(Xkss)

n_trees = range(10, 100, 10)
r2_scores = []

for n in n_trees:
    rf = RandomForestRegressor(n_estimators=n, random_state=42) #CITE CHATGPT: Initializes the random forest regression model with n trees that loops each time to find the best cv score
    scores = cross_val_score(rf, Xsss, ykss, cv=10, scoring='r2')
    r2_scores.append(np.mean(scores))

plt.figure(figsize=(8, 5))
plt.plot(n_trees, r2_scores)
plt.xlabel("Number of Trees")
plt.ylabel("Cross-Validation R2 Score")
plt.title("Random Forest Regression Performance vs. Number of Trees")
plt.show()

print(f"Best R2 Score: {max(r2_scores):.4f} at {n_trees[np.argmax(r2_scores)]} trees")
Best R2 Score: 0.6920 at 80 trees
In [17]:
# make a 70/30 train test split: NOTE Xsss is already scaled using the standard scaler - a very important part!
X_train, X_test, y_train, y_test = train_test_split(Xsss, ykss, test_size=0.3, random_state=42)

best_n_trees = 80 #Takes the best tree number from the previous part to optimize the regression model
rf_best = RandomForestRegressor(n_estimators=best_n_trees, random_state=42) #CITE: GPT: Creates the random forest regression model used
rf_best.fit(X_train, y_train) #Fitting to the data using the model creating with the optimized tree number
y_pred = rf_best.predict(X_test) #Prediction of the hits/q parameter using the model trained
score =  rf_best.score(X_test, y_test) #scoring the model on one fold of data with r2
score_CV = cross_val_score(rf_best, Xsss, ykss, cv=10, scoring='r2') #Using a cross validation score across 10 folds
r2 = r2_score(y_test, y_pred) #scoring the model on one fold of data with r2
mse = mean_squared_error(y_test, y_pred) #Mean squared error to compute the average squared residual value

#Plot the predicted fit vs the actual data
plt.figure(figsize=(7, 5))
plt.scatter(y_test, y_pred, alpha=0.6, color='blue', label="Predicted vs Actual")
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], '--r', label="Perfect Fit")  #CITE: GPT: Just makes the diagonal line which represents the perfect fit

plt.xlabel("Actual 'hits/q'", fontweight='bold', fontsize=20)
plt.ylabel("Predicted 'hits/q'", fontweight='bold', fontsize=20)
plt.title(f"Actual vs. Predicted (Best RF Model): {best_n_trees} Trees)", fontweight='bold', fontsize=22)


plt.legend()
plt.show()
print(f"The r2 score for this fit is {score:.2} and has a Mean Squared Error of {mse:.2}")
print(f"The cross value score is {np.mean(score_CV):.2}")
The r2 score for this fit is 0.42 and has a Mean Squared Error of 0.6
The cross value score is 0.69

This section now uses a K-mean clustering to group the data into 3 distinct events correlating to the different neutrino flavours that caused them. As explained in the paper, group 0 clusters corresond with the high energy muon neutrinos, group 1 represents the tau neutrino group and group 2 is the electron neutrino group. This labelling is consistent in both k-mean clustering tasks

In [27]:
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning, module="sklearn")

#CITE PLOT DESIGN FROM: https://seaborn.pydata.org/generated/seaborn.jointplot.html
initial_centroids = np.array([[10700, 0.99], [13000, 1.2], [20000, 1.2]]) #Specifying the location of the centroids based on EDA analysis
dfX=pd.DataFrame({'hits/t': df_all_events['hits/t'], 'hits/q': df_all_events['hits/q']}) #Make a new dataframe to be plotted
# Drop rows with NaN values in either 'hits/t' or 'hits/q' columns
#dfX = dfX.dropna(subset=['hits/t', 'hits/q'])
model = KMeans(n_clusters=3, init=initial_centroids, random_state=42).fit(dfX) #Creates the kmeans clustering with three clusters centred as the location of the centroids
#Predicts the location of the clusters
predicted = model.labels_
centroids = model.cluster_centers_
dfX['predicted'] = predicted
print("Location of centroids: ")
print(centroids)

df_all_events.copy()

#Plots the actual clusters  using the seaborn plotting style of jointplot
dfX['hits/t']=df_all_events['hits/t']#Categorizes the independent variable in the mass events dataframe by the clusters
dfX['hits/q']=df_all_events['hits/q']#Categorizes the dependent variable hits/q in the mass events dataframe by the clusters
cmap = plt.cm.get_cmap('Set1', len(dfX['predicted'].unique())) #Creates a colour map based on the predicted clusters
plts = sns.jointplot(data=df_all_events, x='hits/t', y='hits/q', hue=dfX['predicted'], palette='Set1')
plts.ax_joint.set_xlabel('hits/t', fontsize=22, fontweight='bold')
plts.ax_joint.set_ylabel('hits/q', fontsize=22, fontweight='bold')
plts.ax_joint.tick_params(axis='both', labelsize=18)
plt.figtext(0.6, 1, 'K-Means Clustering Jointplot Using hits/q as a Function of hits/t', ha='center', fontsize=20, fontweight='bold')
plt.yscale('log')

plt.show()
Location of centroids: 
[[1.0676513e+04 1.1690106e+00]
 [1.3158599e+04 9.4351619e-01]
 [2.1319223e+04 1.0510709e+00]]
In [19]:
score = silhouette_score(dfX, predicted, metric='euclidean')#how disernable the groups are
score
Out[19]:
0.6525506774802826

I also do this for the hits/z and hits/q parameters, since one might expect higher energy meuon neutrinos to be able to penetrate deeper into the ice, as opposed to background electron and tau neutrinos

In [28]:
#CITE PLOT DESIGN FROM: https://seaborn.pydata.org/generated/seaborn.jointplot.html
dfXx=pd.DataFrame({'hits/z': df_all_events['hits/z'], 'hits/q': df_all_events['hits/q']})
# Drop rows with NaN values in either 'hits/t' or 'hits/q' columns
#dfX = dfX.dropna(subset=['hits/t', 'hits/q'])
modelz = KMeans(n_clusters=3, init=initial_centroids, random_state=42).fit(dfXx)

#Same functional form as above!
predictedz = modelz.labels_
centroidsz = modelz.cluster_centers_
dfXx['predicted'] = predictedz
print("Location of centroids: ")
print(centroidsz)

df_all_events.copy()

dfXx['hits/z']=df_all_events['hits/z']
dfXx['hits/q']=df_all_events['hits/q']
cmap = plt.cm.get_cmap('Set1', len(dfXx['predicted'].unique()))

plts = sns.jointplot(data=df_all_events, x='hits/z', y='hits/q', hue=dfXx['predicted'], palette='Set1')
plts.ax_joint.set_xlabel('hits/z', fontsize=22, fontweight='bold')
plts.ax_joint.set_ylabel('Log Axis hits/q', fontsize=22, fontweight='bold')
plts.ax_joint.tick_params(axis='both', labelsize=18)
plt.figtext(0.6, 1, 'K-Means Clustering Jointplot Using hits/q as a Function of hits/z', ha='center', fontsize=20, fontweight='bold')
plts.ax_joint.legend(title='Cluster', fontsize=12, title_fontsize=13) #CITE CHATGPT: Makes the legend smaller so it is not blocking points

plt.yscale('log')

plt.show()
Location of centroids: 
[[-184.30653      1.0682168]
 [-376.03217      0.8831614]
 [ 203.79977      1.1407628]]
In [21]:
score = silhouette_score(dfXx, predicted, metric='euclidean') #how disernable the groups are
score
Out[21]:
0.3480709722957436

Subset of All Events Data Into Clusters¶

Separating the large daaset into groups based on the K-means clustering allows us to create 3 subsets of the data that can each have a random forest regression run on it. In this way, not only have we separated out the data presumably corresponding to each neutrino event, but also are able to now predict how each neutrino flavour will behave. This is a step up from the previous regression which just simply tries to predict different physical phenomena all at once.

Clusters Determined by hits/q as a Function of Hits/q¶

This first regression uses the hits/t vs hits/q clustering algorithm results to create subsets, each of which are then predicted using a random forest algorithm

In [22]:
#Same functional form as above!!!: only difference is that it loops over the subsets defined by the clusters
#these subsets correspond to the different neutrino flavours

df_all_events.copy()
df_all_events['cluster'] = predicted
# Makes subsets of the data based on how the k means clustering predicted the groups
cluster_A = df_all_events[df_all_events['cluster']==0]
cluster_B = df_all_events[df_all_events['cluster']==1]
cluster_C = df_all_events[df_all_events['cluster']==2]
groups = [cluster_A, cluster_B, cluster_C]
l=0
#loops over the clusters 0,1 and 2 corresponding to the different flavours determined by kmeans
for i in groups:
  Xkss = i[[c for c in i.columns if c not in ['q', 'hits/q', '_SINGLETON_/INDEX', 'event']]]
  ykss = i['hits/q'].values

  # Standardize features!!! Very important so that we are comparing quantities on different scales
  ss = StandardScaler()
  Xsss = ss.fit_transform(Xkss)
  #To prevent over fitting and poor performance, the first group (i.e. group 0) uses a 40/60 train test split
  #other groups use 70/30
  if l == 0:
    size = 0.6
  else:
    size=0.3
  # make a 70/30 train test split: NOTE Xsss is already scaled using the standard scaler - a very important part!
  X_train, X_test, y_train, y_test = train_test_split(Xsss, ykss, test_size=size, random_state=42)

  best_n_trees = 80
  rf_best = RandomForestRegressor(n_estimators=best_n_trees, random_state=42) #CITE: GPT: instantiates the random forest regression model used
  rf_best.fit(X_train, y_train)
  y_pred = rf_best.predict(X_test)
  score =  rf_best.score(X_test, y_test)
  score_CV = cross_val_score(rf_best, Xsss, ykss, cv=10, scoring='r2')
  r2 = r2_score(y_test, y_pred)
  mse = mean_squared_error(y_test, y_pred)

  plt.figure(figsize=(7, 5))
  plt.scatter(y_test, y_pred, alpha=0.6, color='blue', label="Predicted vs Actual")
  plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], '--r', label="Perfect Fit")  #CITE: GPT: Just makes the diagonal line which represents the perfect fit

  plt.xlabel("Actual 'hits/q'", fontweight='bold', fontsize=20)
  plt.ylabel("Predicted 'hits/q'", fontweight='bold', fontsize=20)
  plt.title(f"Actual vs. Predicted Group {l}: {best_n_trees} Trees)", fontweight='bold', fontsize=22)

  plt.legend()
  plt.tight_layout()
  plt.show()
  print(f"The r2 score for this fit is {score:.2} and has a Mean Squared Error of {mse:.2}")
  print(f"The cross value score is {np.mean(score_CV):.2}")
  print(f"The mean optical signal for this group is {np.mean(y_pred)}")
  l=l+1
The r2 score for this fit is 0.23 and has a Mean Squared Error of 2.9
The cross value score is 0.54
The mean optical signal for this group is 1.1848178403079106
The r2 score for this fit is 0.46 and has a Mean Squared Error of 0.37
The cross value score is 0.76
The mean optical signal for this group is 0.9386202857034965
The r2 score for this fit is 0.41 and has a Mean Squared Error of 0.47
The cross value score is 0.59
The mean optical signal for this group is 1.069292957918956

Clusters Determined by hits/q as a Function of Hits/z¶

This second regression uses the hits/z vs hits/q clustering algorithm results to create subsets, each of which are then predicted using a random forest algorithm

In [23]:
#Same functional form as above, just using the hits/z feature columns instead

df_all_events.copy()
df_all_events['cluster'] = predictedz

cluster_A = df_all_events[df_all_events['cluster']==0]
cluster_B = df_all_events[df_all_events['cluster']==1]
cluster_C = df_all_events[df_all_events['cluster']==2]
groups = [cluster_A, cluster_B, cluster_C]
l=0
for i in groups:
  Xkss = i[[c for c in i.columns if c not in ['q', 'hits/q', '_SINGLETON_/INDEX', 'event']]]
  ykss = i['hits/q'].values

  # Standardize features!!! Very important so that we are comparing quantities on different scales
  ss = StandardScaler()
  Xsss = ss.fit_transform(Xkss)
  if l == 0:
    size = 0.6
  else:
    size=0.3
  # make a 70/30 train test split: NOTE Xsss is already scaled using the standard scaler - a very important part!
  X_train, X_test, y_train, y_test = train_test_split(Xsss, ykss, test_size=size, random_state=42)

  best_n_trees = 80
  rf_best = RandomForestRegressor(n_estimators=best_n_trees, random_state=42) #CITE: GPT: Creates the model used
  rf_best.fit(X_train, y_train)
  y_pred = rf_best.predict(X_test)
  score =  rf_best.score(X_test, y_test)
  score_CV = cross_val_score(rf_best, Xsss, ykss, cv=10, scoring='r2')
  r2 = r2_score(y_test, y_pred)
  mse = mean_squared_error(y_test, y_pred)

  plt.figure(figsize=(7, 5))
  plt.scatter(y_test, y_pred, alpha=0.6, color='blue', label="Predicted vs Actual")
  plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], '--r', label="Perfect Fit")  #CITE: GPT: Just makes the diagonal line which represents the perfect fit

  plt.xlabel("Actual 'hits/q'", fontsize=20, fontweight='bold')
  plt.ylabel("Predicted 'hits/q'", fontsize=20, fontweight='bold')
  plt.title(f"Actual vs. Predicted Group {l}: {best_n_trees} Trees)", fontsize=22, fontweight='bold')

  plt.legend(fontsize=15)
  plt.show()
  print(f"The r2 score for this fit is {score:.2} and has a Mean Squared Error of {mse:.2}")
  print(f"The cross value score is {np.mean(score_CV):.2}")
  print(f"The mean optical signal for this group is {np.mean(y_pred)}")
  l=l+1
The r2 score for this fit is 0.3 and has a Mean Squared Error of 0.96
The cross value score is 0.85
The mean optical signal for this group is 1.077547095455038
The r2 score for this fit is 0.53 and has a Mean Squared Error of 0.22
The cross value score is 0.7
The mean optical signal for this group is 0.8952648581447339
The r2 score for this fit is 0.39 and has a Mean Squared Error of 3.5
The cross value score is 0.38
The mean optical signal for this group is 1.1847571540733952

Confusion Matrix¶

Since we have used k-menas clustering twice using two different independent variable parameters, and have grouped 3 clusters in each, we have created pseudo labels for our data. In other words, it would seem like astrophysical muon neutrinos with high energy (i.e. high hits/q values) would be in group 0 in the hits/t dependent variable clustering, as well as group 0 in the hits/z clustering. Thus, we can compare how often an index classified in group 0 for the hits/t clustering is inside group 0 for the hits/z clustering and see if our hypotesis matches.

In [24]:
#CITE CHART DESIGN: https://medium.com/@dtuk81/confusion-matrix-visualization-fc31e3f30fea
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
#Link the cluster group labels from the hits/t and hits/z feature columns with their respective points in the actual raw data set
df_all_events.copy()
df_all_events['cluster_t'] = predicted
df_all_events['cluster_z'] = predictedz
#Compare how often group association matches between both of the kmeans clustering algorithms via a confusions matrix
conf_mat = confusion_matrix(df_all_events['cluster_t'], df_all_events['cluster_z'])
sns.set(font_scale=1.5)#AI useage: makes the font bigger
#This is visualized using the chart design from the link above
sns.heatmap(conf_mat/np.sum(conf_mat), annot=True,
            fmt='.2%', cmap='Blues')
plt.title("Confusion Matrix Between hits/t and hits/z Classifier", fontweight='bold')
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
Out[24]:
(array([0.5, 1.5, 2.5]),
 [Text(0, 0.5, '0'), Text(0, 1.5, '1'), Text(0, 2.5, '2')])

While the group 2 does not work as accuratley, our predictions of group zero are quite accurate and lead to few false guesses which is promising for our method! The point of the confusion matrix was not to validate the accuracy of groups 1 and 2 since electron and tau neutrino signals are so similar, but the fact that our expectations for the observations of muon neutrinos are coincident for the hits/z and hits/t clusters is interesting.

References¶

[1]Confusion Matrix Visualization: https://medium.com/@dtuk81/confusion-matrix-visualization-fc31e3f30fea

[2]Particle Physics Playgourn: https://sites.google.com/siena.edu/particle-physics-playground/icecube-introduction?authuser=0

[3]Jointplot Documentation: https://seaborn.pydata.org/generated/seaborn.jointplot.html

[4]Waskom, M., Botvinnik, Olga, O'Kane, Drew, Hobson, Paul, Lukauskas, Saulius, Gemperline, David C, … Qalieh, Adel. (2017). mwaskom/seaborn: v0.8.1 (September 2017). Zenodo. https://doi.org/10.5281/zenodo.883859

[5]Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., … Oliphant, T. E. (2020). Array programming with NumPy. Nature, 585, 357–362. https://doi.org/10.1038/s41586-020-2649-2

[6]McKinney, W., & others. (2010). Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference (Vol. 445, pp. 51–56).

[7]Pedregosa, F., Varoquaux, Ga"el, Gramfort, A., Michel, V., Thirion, B., Grisel, O., … others. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct), 2825–2830.

[8]ChatGPT

[9]Waskom, M. L. (2021). seaborn: statistical data visualization. Journal of Open Source Software, 6(60):3021.

[10]Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830

[11] Hunter, J. D. (2007). Matplotlib: A 2d graphics environment. Computing in Science & Engineering, 9(3):90–95

[12]Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del R´ıo, J. F., Wiebe, M., Peterson, P., G´erard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., and Oliphant, T. E. (2020). Array programming with NumPy. Nature, 585(7825):357–362